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Abstract 

The  purpose  of  this  investigation  was  to  identify  phoneme 
segments  as  they  appeared  in  continuous  speech.  The  input 
device  was  an  audio  tape  recorder  from  which  the  analog 
speech  signal  was  digitized  and  fast  Fourier  transformed. 

The  amplitudes  of  this  transformed  signal  were  combined  in  a 
logarithmic  manner  and  printed  out  in  a 16  channel  digitized 
spectrogram.  Sixty-one  prototypes  were  selected  to  represent 
the  phonemes  of  the  English  language.  These  prototypes  were 
stored  and  used  in  a running  crosscorrelation  with  the  unknown 
speech  signal.  The  amplitude  values  resulting  from  the  corre- 
lation process  were  used  to  predict  phoneme  locations  and  the 
values  were  compared  in  order  to  identify  the  correct  phoneme. 

The  phonemes  were  selected  from  Speaker  A's  speech  signal 
and  tests  were  conducted  to  analyze  utterances  from  Speaker  A 
and  Speaker  B.  For  Speaker  A,  location  was  rated  at  81  per- 
cent while  identification  was  rated  at  45  percent.  For  Speak- 
er B,  location  was  found  to  be  70  percent  with  identification 
at  40  percent. 

Spatial  filtering  techniques,  uniform  length  prototypes, 
and  various  normalization  procedures  were  investigated  next 
with  the  result  of  improving  location  for  Speaker  B. 
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COMPUTER  IDENTIFICATION 
OF  PHONEMES 
IN  CONTINUOUS  SPEECH 

I.  Introduction 

This  thesis  is  a continuation  of  a study  begun  by 
R.W.  Neyman  (Ref  21).  The  long  term  goal  of  this  study  is 
to  advance  the  possibility  of  unrestricted  speech  recogni- 
tion by  machine.  For  centuries  man  has  dreamed  of  building 
machines  that  could  hear  and  speak  the  language  of  men  (Ref 
18:45).  For  more  than  three  decades,  concentrated  efforts  of 
combined  scientific  disciplines  have  been  expended  to  solve 
this  problem.  None  have  been  successful  in  understanding 
continuous  speech.  Some  men,  after  spending  years  of  fruit- 
less effort,  have  grown  so  discouraged  as  to  label  all  energy 
spent  in  this  area  as  wasted  time  (Ref  23:41).  Others  have 
said,  "Engineers  working  in  this  area  with  continuous  speech 
recognition  in  mind,  have  a right  to  be  discouraged"  (Ref 
18:58).  Nevertheless,  just  as  men  watching  birds  fly  for 
centuries  were  inspired  to  countless  trials  before  success, 
the  mere  fact  that  man  can  understand  continuous  speech  in  a 
variety  of  environments,  motivates  the  attempt  to  build  a 
machine  that  can  achieve  the  same  result. 
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Motivation 


In  experiments  involving  speech  and  other  communication 
modes  like  typing,  information  is  transferred  almost  twice  as 
fast  with  speech  as  without  speech  (Ref  28:41).  Neyman  (Ref 
21:2)  computes  the  rate  of  information  transfer  in  speech  to 
be  on  the  order  of  50  bits/second  based  on  Flanagan's  estimate 
of  approximately  five  bits  of  information  per  phoneme  (Ref 
7:4).  Besides  speed,  other  advantages  of  being  able  to  com- 
municate verbally  with  machine  are  constantly  being  expounded. 
Man  will  have  both  his  hands  free  to  do  required  work  while 
actively  passing  on  information  to  the  system,  and  a sub- 
stantial amount  of  training  can  be  eliminated  in  the  man- 
machine  interface  area. 

Background 

While  several  isolated  word  recognition  systems  for  small 
vocabularies  with  known  speakers  are  commercially  available, 
it  may  be  years  before  machines  can  recognize  normal  conver- 
sational speech  (Ref  28:40).  The  problems  associated  with 
understanding  of  continuous  speech  are  much  more  complex 
than  those  of  isolated  speech.  Experiments  indicate  that 
one-fourth  to  one-half  the  words  in  normal  conversational 
speech  are  unintelligible  when  taken  out  of  context  and  heard 
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in  isolation  (Ref  28:41).  This  seems  to  indicate  that  the 
system  for  understanding  continuous  speech  must,  of  necessity, 
use  context  related  rules.  In  fact,  psychoacoustic  experiments 
show  that  listeners  use  semantic,  syntactic,  prosodic,  prag- 
matic, and  acoustic  knowledge  to  understand  acoustically 
corrupted  speech  (Ref  19  and  Ref  27).  Whether  one  accepts 
this  theory  or  not,  it  seems  clear  that  some  system  must  be 
employed  that  can  "hear"  anu  perform  a one-to-one  mapping  to 
a perception  space  so  that  the  system  can  "know"  what  it 
heard  even  if  additional  analysis  must  be  performed  before 
the  meaning  and  use  are  determined.  A look  at  some  current 
methods  of  analysis  reveals  that  memory  requirements  limit 
the  efficiency  of  today's  systems. 

Since  the  most  accurate  system  of  isolated  word  recogni- 
tion available  today  uses  template  matching  techniques  (Ref 
11  and  Ref  29),  it  seems  reasonable  to  consider  the  amount  of 
memory  required  to  represent  various  breakdowns  of  the  English 
language.  Table  I (Ref  8:91)  shows  the  relative  frequency  of 
occurrence  of  sounds  and  words  in  ordinary  spoken  English. 

One  can  see  that  732  words  constitute  75  percent  of  the  words 
used  in  normal  conversational  speech,  whereas  only  19  sounds 
are  required  to  make  up  the  same  percentage  of  total  sounds 
used. 
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TABLE  I 


Relative  Frequency  of  Usage  of  Sounds  and  Words 


Number  of  Sounds 

% of  Time  Used 

Number  of  Words 

4 

25 

9 

9 

50 

69 

19 

75 

732 

-- 

78.6 

1027 

40+ 

100 

-- 

As  long  as  the  total  number  of  words  is  small,  memory 
considerations  will  not  be  a prime  factor,  but  for  continuous 
speech  systems  with  sizable  vocabularies,  a more  efficient 
coding  or  decomposition  system  would  be  to  use  the  phonemes 
as  prototypes.  This  approach  has  considerable  appeal,  and 
much  of  the  automatic  speech  research  has  concerned  automatic 
phoneme  recognizers  (Ref  28:48).  Even  systems  that  use  stored 
word  templates  could  profit  from  a reliable  phoneme  recogni- 
zer to  reduce  the  amount  of  time  for  template  matching  by 
selective  recall  of  stored  words  (Ref  28:45).  The  ultimate 
hope  for  a phoneme  recognizer  would,  of  course,  eliminate  the 
need  for  word  prototype  storage. 

Another  motivating  force  to  use  the  phonemic  breakdown 
and  prototype  storage  is  the  ease  with  which  a correlation 
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procedure  can  be  implemented.  This  process  holds  the  addi- 
tional promise  of  being  closely  related  to  the  process  carried 
out  in  the  human  cortex  as  proposed  by  Fano  and  Huggins  (Ref 
15),  Cherry  (Ref  3),  and  Kabrisky  (Ref  12). 

McLachlan  (Ref  20)  demonstrates  a visual  correlator  that 
is  able  to  locate  and  identify  prototypes,  and  Neyman  was 
able  to  construct  an  auditory  analog  of  this  system.  His 
method  was  to  first  construct  a digital  spectrogram  that  would 
display  the  energy  spectrum  of  successive  short  time- segments 
of  speech.  It  is  generally  agreed  that  the  information  needed 
to  recognize  speech  is  contained  in  the  spectrum  (Ref  17:115). 
The  spectrogram  development  is  explained  fully  in  Chapter  III. 
After  the  spectrogram  was  developed,  prototypes  for  the  vari- 
ous phonemes  of  speech  were  selected  and  then  correlation  was 
accomplished  with  decisions  based  on  the  maximum  crosscorrela- 
tion value  that  occurred  over  a specified  length  of  utterances. 

Objective 

The  objective  of  this  research  was  to  continue  the  origi- 
nal investigation  begun  by  Neyman  (Ref  21)  in  order  to  locate 
and  identify  phonemes  in  continuous  speech  using  pattern 
recognition  and  crosscorrelation  techniques.  Neyman  achieved 
excellent  location  and  identification  results  using  a 10  class 
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problem.  When  the  prototypes  were  extended  to  a 47  class 
problem,  location  dropped  only  slightly  while  phoneme  identi- 
fication fell  to  34  percent;  however,  correct  category  identi- 
fication was  only  reduced  to  62  percent. 

In  the  analysis  of  results,  certain  phonemes  were  not 
looked  for  as  Neyman  believed  an  adequate  prototype  did  not 
exist  for  that  sound.  Neyman  suggested  that  follow-on  studies 
in  this  area  extend  the  phoneme  set  to  include  at  least 
nasalized  vowels  and  some  prototypes  from  ending  and  begin- 
ning phonemes  that  were  the  same  sound  but  different  structure. 
He  also  suggested  that  spatial  filtering  techniques  that  had 
proven  successful  in  recognition  of  hand-written  letters  by 
Carl  and  Hall  (Ref  2)  and  in  the  recognition  of  isolated 
words  for  two  speakers  by  Daily  and  Sutton  (Ref  4)  be  incor- 
porated to  extract  the  important  information  while  minimizing 
the  "noise"  that  clouds  the  identification  process. 

Scope 

The  scope  of  the  project  was  to  expand  the  set  of  proto- 
types to  include  nasalized  vowels  and  additional  ending  and 
beginning  sounds  and  at  least  one  combination  sound.  Low-pass 
filtering  was  tried  next  and  required  the  modification  of 
the  previously  used  normalization  process.  It  was  also  neces- 


6 


sary  to  select  a new  set  of  prototypes  of  uniform  length  in 
order  to  use  the  filtering  scheme.  Two  sets  of  seven  sen- 
tences composed  by  the  author  were  analyzed  with  no  filtering 
applied.  One  set  of  seven  sentences  spoken  by  a speaker  with 
a different  dialect  was  analyzed  with  no  filter  present. 
Low-pass  filters  of  varying  size  were  tested  next.  In  all 
the  cases  analyzed,  a complete  set  of  prototypes  was  assumed. 
The  two  sets  of  prototypes  that  were  used  and  their  key  words 
are  listed  in  Table  II  and  Table  III. 
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II.  Data  Acquisition  and  Pre-Processing 


The  same  recording  equipment  was  used  for  this  study  as 
was  used  by  Neyman  (Ref  21).  The  recorder  used  was  the 
Ampex  Model  F4450  stereo  tape  recorder.  The  recordings  were 
made  in  a very  quiet  room  with  minimum  background  noise. 

These  recordings  were  easily  understood  by  the  human  ear  and 
were  judged  to  be  satisfactory  for  input  to  the  digitization 
equipment . 

The  speech  samples  were  recorded  at  a normal  speaking 
level  on  one  channel  of  the  stereo  tape  recorder  while  a 
periodically  interrupted  2000  Hz  tone  was  recorded  on  the 
second  channel.  The  tone,  provided  by  a Model  III  Wavetek 
signal  generator,  was  used  to  indicate  recording  intervals  for 
the  digitization  problem.  A one-second  tone  preceded  each 
speech  record.  The  tone  was  turned  off  during  speech  record- 
ings to  eliminate  crosstalk  between  channels. 

The  tape  recordings  provided  a permanent  record  in  the 
event  that  the  digitization  process  had  to  be  reaccomplished. 
The  recordings  were  also  an  aid  in  analyzing  the  computer 
representations  of  the  various  speech  samples  to  ascertain 
exactly  which  phonemes  were  uttered.  Another  benefit  of  this 
recording  system  was  that  the  signals  could  be  recorded  at 


one  speed  and  then  played  back  at  another,  thus  increasing 
the  sampling  rate  in  the  digitization  procedure. 

Analog- to-Diqital  Conversion 

The  initial  processing  of  the  analog  speech  signal  was 
accomplished  by  the  Analog/Hybrid  Systems  Branch  of  the  ASD 
Computer  Center  in  the  same  manner  as  processing  of  the  Neyman 
data  (Ref  21:16). 

The  recording  had  been  made  at  a 7-J-  ips  rate.  By  using 
a speed  of  one-half  that  (3  3/4  ips),  the  sampling  rate  was 
effectively  doubled.  The  accepted  bandwidth  of  the  amplifiers 
used  in  the  analog  system  was  0 to  2500  Hz.  The  audio  signal 
was  first  low-pass  filtered  to  2500  Hz  to  insure  a band  limited 
signal,  and  the  sampling  rate  was  set  at  5 KHz  in  order  to 
satisfy  the  Nyquist  sampling  criteria.  This  resulted  in  an 
over-all  effect  of  a signal  that  had  been  low-pass  filtered 
to  5 KHz  and  sampled  at  10  KHz. 

The  sampled  input  signal  was  amplified  to  approximately 
100  volts  to  make  more  effective  use  of  the  analog  representa- 
tion. The  signal  was  fed  through  a Comcor  Ci-5100  high  speed 
interface  to  a Xerox  Sigma  7 general  purpose  digital  computer. 
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Signal  Transformation 

The  digitized  analog  speech  data  was  then  converted  into 
an  equivalent  freguency  representation  by  using  fast  Fourier 
transform  (FFT)  techniques.  By  selecting  a relatively  wide 
window  to  input  the  time  domain  samples  to  the  FFT,  the  time 
resolution  of  the  transformed  signal  was  enhanced  while  the 
frequency  resolution  was  degraded.  This  selection  was  based 
on  the  previous  work  of  Oppenheim  (Ref  22:57-62).  Neyman 
(Ref  21:17-18)  selected  the  window  size  to  be  128  samples  in 
length  and  to  step  the  window  thru  the  data  in  128  sample 
segments.  An  in-house  program  called  AMPSPC  was  used  by  the 
Analog/Hybrid  System  Branch  to  compute  the  forward  FFT  and 
return  the  absolute  magnitude  of  the  values  computed  (Ref 
9:42) . 

Using  the  conjugate  symmetry  property  of  the  FFT,  the 
above  procedure  resulted  in  64  discrete  amplitude  values 
separated  by  78.125  Hz.  Since  the  original  data  was  being 
sampled  at  a 10  KHz  rate,  a 128  sample  segment  occurred  every 
128/104  sec  or  12.8  ms.  These  sample  segments  are  referred 
to  as  "frames".  The  resulting  data  was  converted  into  decimal 
form  by  dividing  by  the  largest  array  value  in  a transformed 
sentence  and  then  written  on  a library  tape  (L-tape)  in  proper 
format  for  the  CDC-6600  computer. 
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Ill . Digital  Signal  Processim 


The  pre-processed  information  received  on  L-tape  from 
the  Analog/Hybrid  System  Branch  was  contained  in  an  m x 64 
array.  The  length  of  the  speech  utterance  determined  the 
value  of  m,  the  number  of  frames  in  an  utterance.  Since  each 
frame  represented  12.8  ms  of  the  original  speech  sample,  a 
one-second  utterance  would  have  1/.0128  or  78  frames.  Each 
element  in  a frame  was  a four  decimal  digit  that  represented 
the  signal  amplitude  in  that  particular  frequency  channel. 
Each  of  the  64  channels  had  78  Hz  separation  between  center 
frequencies  of  adjacent  channels. 

Neyman  (Ref  21:19)  used  a restructuring  of  this  data 
format  in  a manner  that  would  approximate  the  sensitivity  of 
the  ear  to  frequency  changes  by  simulating  to  the  logarithmic 
nature  of  the  ear  at  frequencies  above  1000  Hz. 


Channel  Compression 

Table  IV  is  included  for  completeness  to  show  how  Neyman 
(Ref  21:20)  grouped  the  frequencies  to  reduce  the  original 
64  channels  to  16.  Since  the  energy  of  the  channels  were 
added  in  each  subgroup,  it  was  not  necessary  to  use  standard 
preemphasis  of  6db/octave  (Ref  26:311)  for  the  higher  fre- 
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quencies  before  processing.  Figure  1,  by  Neyman  (Ref  21:21), 
is  included  to  show  the  comparison  of  frequency  distribution 
of  this  system  and  a vocoder  system.  Reduction  from  64  to 
16  channels  also  reduced  the  computer  storage  requirement  by 
75  percent . 

Spectrogram  Development 

Spectrograms  are  used  in  pattern  recognition  to  visually 
display  the  frequency  context  of  speech.  Although  it  is  not 
known  exactly  what  accuracy  can  be  achieved  in  visually  read- 
ing high  quality  spectrograms,  the  extensive  work  of  Potter,  j 

Kopp,  and  Green  (Ref  24)  indicate  that  sufficient  information 
is  encoded  in  the  spectrogram  in  order  to  reproduce  the 
original  message.  More  recent  tests  on  the  usefulness  of  the 
spectrogram  in  continuous  speech  recognition,  indicate  visual 
reading  successes  of  35  - 100  percent  (Ref  13:6).  Such  high 
success  rates  were  attainable  only  when  the  test  subjects 
were  given  additional  cues;  however,  this  is  viewed  as 
comparable  to  a person  listening  to  a message  with  "context" 
and  associated  cues.  If  a spectrogram  contains  sufficient 
information  for  visual  interpretation,  then  it  is  feasible 
that  a computer  may  be  able  to  decipher  the  message. 
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Neyman  (Ref  21:23)  developed  a limited-detail  digital 
spectrogram  by  using  an  overprint  technique  as  specified  in 


I 


E 


Table  V.  His  program  printed  the  spectrogram  adjacent  to 
the  16  channels  of  numerical  data.  Each  channel  had  a thresh- 
old for  overprint;  a round-up  procedure  was  used  to  form 
integer  values  and  these  integer  values  corresponded  to  the 
overprint  "level  of  darkness"  figures  of  Table  V.  Although 
the  array  values  could  be  studied  to  observe  the  energy 
changes  and  locate  low  energy  phonemes,  a more  complete 
depiction  was  considered  to  be  of  great  value. 

Since  this  research  program  ultimately  performed  energy 
normalization  before  the  decision  space  was  reached,  an  energy 
normalized  spectrogram  was  judged  as  invaluable  in  finding  the 


Table  V 
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individual  phonemes.  A sample  spectrogram  of  each  type  is 
shown  in  Figure  2.  The  modified  program  OCTAVE  is  included 
in  Appendix  B.  This  program  accomplishes  the  logarithmic 
compression  of  the  original  64  channels  of  data  as  well  as 
the  generation  of  the  energy  normalized  speech  spectrograms. 

The  author  gained  great  insight  into  the  speech  process 
by  studying  the  patterns  of  the  various  spectrograms.  This 
confirms  the  conclusion  of  Klatt  and  Stevens  (Ref  13:27): 

In  conclusion,  it  is  suggested  that  every 
serious  worker  in  the  area  of  automatic  speech 
recognition  should  undertake  to  read  spectro- 
grams in  an  organized  way  similar  to  the  proj- 
ects that  we  have  described.  It  is  an  excellent 
way  of  learning  a great  deal  about  speech,  and 
it  is  the  only  way  to  convince  yourself  of  the 
complexities  involved  and  of  the  necessity  for 
approaching  the  problem  with  more  sophisticated 
forms  of  analysis. 

Selection  of  Prototypes 

Neyman  (Ref  21:26)  selected  prototypes  from  isolated 
words  since  this  method  offered  a straight  forward  method 
of  selecting  individual  prototypes  with  little  chance  of 
accidentally  combining  frames  of  different  phonemes.  He  also 
suggested  that  the  phoneme-word  be  included  in  a structured 
sentence  (Ref  21:74-75)  to  more  adequately  produce  the  phoneme 
as  it  occurred  in  continued  speech. 
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Fig.  2.  Non-Normal i zed  and  Normalized  Spectrograms 


The  words  of  interest  were  embedded  in  the  sentence 


"Say  (word)  instead."  The  enunciation  was  very  precise  in 
this  setting  and  tended  to  produce  phonemes  of  length  that 
agreed  very  well  with  predicted  lengths  (Ref  8:59-67).  It 
was  also  evident  that  there  is  great  variability  in  the 
duration  of  certain  sounds.  The  vowels  and  vowel- like  sounds 
showed  the  most  variation  (Ref  26:315).  A good  example  of 
vowel  variation  was  noted  by  the  effect  of  the  following 
consonant  on  the  vowel  length.  The  vowel  tends  to  be  longer 
when  followed  by  a final  voiced  consonant  than  when  followed 
by  a final  voiceless  consonant  (Ref  14:18).  Vowel  length 
tends  to  fall  m f~e  range  of  80-360  ms.  The  consonants  are 
generally  much  shorter  than  vowels  with  many  being  as  short 
as  70  ms  (Ref  40:19-68  and  Ref  8:59-67). 

It  is  also  important  to  realize  that  the  energy  concen- 
trations or  formants  in  vowels  are  constant  during  the  dura- 
tion of  the  sound  but  must  have  a beginning  and  ending  transi- 
tion. This  is  true  even  if  the  vowel  is  uttered  in  isolation 
since  it  does  not  start  and  stop  instantaneously.  Therefore, 
selection  of  vowel  prototypes  should  be  made  from  the  steady- 
state  section  of  the  vowels. 

The  pictorial  representations  of  phonemes  from  Potter, 
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Kopp,  and  Green  (Ref  24),  along  with  the  computer  generated 
spectrograms  and  an  audio  tape  of  the  utterances,  facilitated 
the  selection  of  phoneme  prototypes. 

There  were  at  least  three  different  sets  of  prototypes 
chosen.  The  first  set  followed  very  closely  the  pattern  set 
by  Neyman  (Ref  21:28)  and  included,  as  Neyman  had  suggested, 
nasalized  vowel  prototypes  and  some  additional  prototypes 
for  the  beginning  and  ending  sounds  (Ref  21:74).  Another  set 
of  prototypes  was  chosen  using  the  same  procedure  except 
that  the  vowels  were  uttered  in  isolated  context.  The  sound 
lasted  over  a two-second  interval  and  the  vowel  prototype 
was  selected  from  the  most  uniform  area  of  the  spectrogram. 

The  consonants  for  this  set  were  chosen  from  deliberate  speech 
with  carefully  enunciated  words  in  the  hope  of  capturing  the 
essence  of  each  sound.  The  third  set  of  prototypes  was  selec- 
ted from  normal  rate  of  speech  sentences  with  no  attempt  to 
modify  the  speaker's  speech  pattern.  This  set  of  prototypes 
limited  each  phoneme  to  the  same  duration.  The  basis  for  this 
selection  was  that  vowel  sounds  can  be  located  consecutively 
for  long  vowels,  that  each  part  of  a dipthong  can  be  located 
separately  and  restructured  by  context  rules,  and  that  the 
uniform  duration  chosen  for  the  prototypes  was  no  shorter 
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than  the  shortest  sound  that  can  occur.  This  last  set 


selection  was  necessary  to  accommodate  the  spatial  filtering 
that  is  discussed  in  Chapter  IV.  The  results  obtained  from 
the  use  of  the  different  prototype  sets  are  discussed  in 
Chapter  V. 


IV.  Recognition  Processing 


The  recognition  phase  operates  on  the  m x 16  arrays  of 
digital  data  and  includes  all  tasks  that  are  performed  on  the 
data  in  order  to  complete  the  phoneme  recognition.  The  upper 
limit  on  m is  500.  This  allows  an  utterance  with  a duration 
of  approximately  6.25  seconds. 

Neyman's  recognition  scheme  (Ref  21:30-46)  was  judged 
to  be  exceptionally  well  designed  and  was  changed  only  where 
necessary  to  accommodate  the  filtering  routine  and  the 
increased  prototype  set.  As  in  the  original  program,  after 
prototype  selection,  the  complete  program  from  microphone  to 
decision  print  out  could  easily  be  converted  to  near  real 
time  if  desired.  However,  to  aid  in  the  manual  analysis  of 
the  data,  the  normalized  and  non-normal ized  versions  of  the 
spectrograms  were  generated.  The  revised  program  as  used  in 
this  research  is  included  in  Appendix  B. 

Normalization 

Normalization,  an  extremely  important  concept  in  speech 
recognition,  is  used  to  help  minimize  some  of  the  many  varia- 
tions that  occur  in  speech.  Using  normalization  techniques 
enables  the  use  of  fewer  templates  or  special  rules  to  rep- 


resent  a speech  sound  faithfully.  These  techniques  include 
normalization  by  (1)  velocity,  (2)  amplitude,  (3)  time, 

(4)  speaker  spectra,  (5)  dynamic  range,  and  (6)  noise  sub- 
traction (Ref  28:51).  Each  of  these  terms  are  explained  in 
Appendix  C. 

In  some  cases  normalization  might  actually  mislead.  One 
example  occurs  in  faster  speech  where  articulatory  targets 
are  less  likely  to  be  reached  than  in  slower  speech.  When 
the  faster  speech  is  time-stretched,  the  target  values  reached 
will  still  have  different  values  from  those  obtained  by  slower 
speech  and  might  lead  to  the  identification  of  the  wrong 
phonemes  (Ref  5:761). 

One  of  the  most  obvious  needs  for  normalization  is  the 
requirement  for  something  similar  to  an  automatic  gain  con- 
trol. Under  this  amplitude  normalization,  the  phoneme  proto- 
types and  the  input  word/ sentence  data  were  unit  normalized 
for  each  frame.  Each  component  of  every  frame  is  normalized 
by  the  formula 


where  x , is  the  normalized  j component  of  a frame  and  i 
nj 

is  used  to  index  all  the  components  of  a frame. 


24 


To  minimize  the  possibility  of  non- inf ormation  bearing 
intervals  and  intentional  stops  in  the  speech  utterance  being 
changed  to  the  point  of  entering  the  decision  scheme,  Neyman 
checked  each  frame  by  the  following  rule  (Ref  21:31) 

16  2 

I (x.\  <0.5  (2) 

i=l ' 

If  the  inequality  was  satisfied,  the  vector  was  not  normal- 
ized. 

A unit  normalization  was  performed  next  on  each  proto- 
type to  insure  that  prototypes  with  excessive  energy  did  not 
falsely  correlate  with  higher  values  than  the  true  weaker 
energy  terms.  The  normalization  that  was  used  was 

length  of  prototype  (3) 

since  the  individual  frames  had  been  previously  normalized 
by  Eq  ( 1 ) . 

One  problem  that  was  discovered  by  using  the  rules 
implied  by  Eqs  (2)  and  (3)  was  that  some  unvoiced  fricatives 
and  stops  did  not  have  every  frame  normalized.  This  resulted 
in  an  apparent  loss  of  energy  that  caused  these  quiet  sounds 
to  have  weak  correlation  values  and  little  chance  of  being 
selected  in  the  phoneme  selection  phase.  To  remedy  this 


problem,  two  spectrograms  were  printed.  One  used  the  rule  of 


t 
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Eq  (2)  while  the  other  did  not.  Figure  2 shows  the  difference 
between  the  two  spectrograms.  The  actual  method  of  using 
these  spectrograms  to  aid  in  the  decision  process  is  dis- 
cussed in  Chapter  V. 


Correlation 

The  "heart"  of  this  recognition  process  is  the  correlator. 
Basically  no  changes  were  made  to  the  Neyman  correlator  (Ref 
21:33-38).  The  method  Neyman  chose  to  accomplish  the  cor- 
relation was  to  use  the  discrete  Fourier  transform.  The  actual 
fast  Fourier  transform  algorithm  used  was  known  as  Fourt  (Ref 
10).  The  two-dimensional  crosscorrelation  of  the  model  proto- 
types with  the  unknown  sentence  data  was  accomplished  by 
taking  the  two-dimensional  discrete  Fourier  transform  of 
both  the  prototypes  and  the  sentence  data.  The  conjugate  of 
one  array  of  transformed  data  was  found  and  point-by-point 
multiplication  of  this  new  array  with  the  other  transformed 
array  yields  a third  array.  The  inverse  transform  of  the 
third  array  produced  the  correlation  coefficients. 

In  order  to  avoid  the  problem  of  "end  effect"  that 
occurs  with  correlation  using  discrete  transforms,  Neyman 
imbedded  each  of  the  data  arrays  in  zeros  before  the  transform 
was  performed  (Ref  21:35-36). 
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The  mechanics  of  the  correlation  sequence  are  given  by 


Neyman  (Ref  21:36-38).  The  largest  section  that  could  be 
transformed  at  one  time  using  Neyman's  scheme  was  48  frames 
of  original  input  data.  This  limitation  can  be  changed  con- 
sistent with  the  constraints  of  the  Fourt  routine.  An  overlap 
of  eight  was  used  between  sections  to  solve  the  problem  with 
larger  prototypes  that  did  not  have  sufficient  space  to  effect 
a complete  correlation  sequence.  The  values  of  the  arrays 
are  defined  in  such  a manner  that  the  correlation  coeffi- 
cients that  are  printed  agree  with  the  frame  numbers  that  are 
printed  along  side  the  coefficients.  A correlation  vector 
was  computed  for  each  of  the  prototypes.  Following  the  deci- 
sion process,  the  sequence  is  repeated  for  the  next  speech 
segment . 
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Phoneme  Location 

The  first  process  in  the  decision  strategy  was  to  find 
possible  areas  of  phoneme  occurrence  in  a sentence  segment. 
To  facilitate  this  decision,  it  was  necessary  to  insure  that 
the  correlation  value  was  high  enough  to  warrant  considera- 
tion. In  order  to  determine  the  maximum  correlation  value 
obtainable,  the  prototypes  were  autocorrelated.  Since  the 
prototypes  and  speech  data  had  been  normalized,  the  maximum 
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value  was  a function  of  prototype  length.  The  maximum  value 


that  could  be  obtained  was  found  by  Neyman  to  be 


z 

max 


[' 


4.19  x 10 


’ ) ( Q )] 


(4) 


where 


z = maximum  correlation 
max 


Q = number  of  frames  defining  the 
prototype 


A phoneme  was  considered  to  exist  if  the  correlation  value 
z ^ satisfied  the  following  inequality 


z.  ^ C z 
l max 


(5) 


The  value  of  C was  chosen  to  be  0.86  (Ref  21:38-39). 

In  Neyman 's  program  there  were  differing  data  range  values 
for  each  level  of  the  decision  process,  i.e.,  the  maximum 
number  specified  by  Eq  (4)  existed  at  the  correlation 
level  and  was  transformed  by  a normalizing  factor  (X  NORM) 
for  the  prototype  vector.  None  of  the  arrays  contained  the 
actual  correlation  value  in  a manner  that  was  easy  to  use. 

A change  was  made  that  combined  the  normalization  factors  of 
Eq  (4)  and  X NORM.  This  caused  all  the  array  values 
to  fall  in  the  range  of  0.1  - 1.0,  with  the  latter  represent- 
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ing  autocorrelation.  On  the  basis  of  empirical  results, 
the  value  of  .86  was  still  considered  a good  threshold  value. 

Another  factor  that  had  to  be  considered  in  accepting 
a candidate  phoneme  was  the  number  of  times  it  occurred  in 
a short  segment  of  speech.  If  additional  occurrences  were 
to  be  considered,  they  were  required  to  have  a correlation 
value  of  greater  than  96  percent  of  the  prime  location  value. 

The  third  area  of  consideration  was  to  insure  that  addi- 
tional locations  fell  outside  the  duration  established  for 
the  prototype  being  correlated.  This  was  done  by  considering 
high  correlation  values  near  the  original  maximum  to  be  part 
of  that  occurrence  of  the  phoneme. 

Once  the  candidate  areas  were  selected,  the  rest  of  the 
vector  was  set  equal  to  zero.  The  maximum  value  of  correla- 
tion was  put  into  the  vector  a number  of  times  corresponding 
to  the  prototype  size. 

Phoneme  Classification 

The  program,  as  listed  in  Appendix  B,  selects  the  phoneme 
based  on  the  magnitudes  of  the  prototype  vectors  in  the  final 
array.  The  overlap  allowed  between  prototypes  is  variable 
in  the  program.  The  following  scheme  was  used  for  this 


analysis 


where 


overlap  = 


1 

9 

12 


Q - 8 
Q - 11 
Q - 15 


Q = prototype  size 


(6) 


The  correlation  coefficient  arrays  were  also  useful  for 
studying  areas  where  incorrect  decisions  had  been  made  to 
determine  if  the  correct  phoneme  had  been  located. 


F ilter inq 

Spatial  filtering  techniques  were  used  by  Daily  and 
Sutton  (Ref  4)  to  improve  the  recognition  of  isolated  words. 

The  same  type  of  filtering  was  used  in  the  prototype  matching 
process.  The  decision  was  made  to  use  a variable  length  filter 
inserted  in  the  FFT  where  correlation  was  being  performed. 

The  FFT  array  contains  64  x 32  complex  terms.  An  easy  filter 
to  implement  consisted  of  replacing  unwanted  terms  with  zeros. 
The  dimensions  of  the  filter  were  varied  by  changing  two 
integer  variables.  When  no  filter  was  desired,  these  variables 
were  set  to  these  maximum  values  of  64  and  32. 

Experiments  with  the  filter  revealed  an  incompatibility 
with  the  normalization  scheme  that  had  been  used  without 
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filtering.  A normalization  factor  had  been  included  to  bring 
all  the  correlation  values  back  to  the  same  general  magni- 
tude after  correlation  so  that  comparison  type  decisions 
could  be  made.  The  filter  removes  energy  from  the  correla- 
tion process  and  this  causes  the  normalization  factors  to  be 
incorrect.  No  easy  method  exists  to  change  the  normalization 
factors  since  they  would  have  to  change  with  each  filter 
dimension  change.  The  solution  was  to  use  a different 
normalization  procedure. 

Filter  Normalization 

The  prototype  and  sentence  data  were  still  normalized 
by  time  frame  as  before  to  serve  as  an  automatic  gain  control. 
The  unit  normalization  process  of  the  prototype  was  relocated 
to  the  FIT  array.  Since  each  component  of  this  array  was 
complex,  the  normalization  consisted  of  dividing  each  term 
of  the  FFT  array  by  Eng  where 


[64 

32  9 

64  32  9 

X 

2 

Eng  = 

1 

r r2. 

hi  13 

+ L I r . 

■ 1 4-  -1  1 J 

1=1  J=1 

(7) 

and  R,  , and  I,  , represent  the  real  and  imaginary  terms  of 
ij  ij 

the  array.  The  normalization  factor  that  is  used  with  this 
method  was  found  empirically  to  be 
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Good  = (175) 


(8) 


* 


where 

Q = prototype  length 

The  15 IQ  relationship  existed  because  the  maximum  prototype 
length  was  15,  and  with  a prototype  of  this  length,  the  maxi- 
mum value  for  autocorrelation  was  175.  This  value  was  stored 
in  the  array  "Good"  and  is  the  single  normalization  factor 
in  this  modified  program.  The  filter  and  normalization  are 
included  in  the  computer  program  of  Appendix  B,  and  the 
results  of  their  use  are  discussed  in  Chapter  V. 
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V.  Results 


The  results  are  presented  in  three  phases.  The  first 
attempts  to  duplicate  the  work  of  Neyman  and  includes  an 
extended  prototype  set  as  Neyman  suggested.  In  phase  two 
an  attempt  is  made  to  improve  the  work  of  phase  one  by  cor- 
recting an  error  in  the  original  Neyman  program.  In  phase 
three  the  results  of  spatial  filtering  combined  with  the 
necessary  program  modification  are  presented.  The  result 
phases  are  preceded  by  a discussion  of  rating  results. 

Scoring  Philosophy 

Existing  ratings  of  the  results  of  recognition  of  various 
types  of  speech  signals,  as  a rule,  are  based  on  the  value 
p = (m/n)  x 1007o,  where  m is  the  quantity  of  correctly 
identified  patterns;  :.i  is  the  quantity  of  patterns  presented 
(Ref  6:9).  Even  though  this  rule  is  generally  used  to 
measure  the  recognition  rate  of  speech  understanding  systems, 
there  are  many  other  measures  that  could  be  defined  if  desired 
that  would  cause  the  ratings  to  differ.  For  this  reason,  it 
is  very  important  to  insure  that  the  exact  method  of  scoring 
is  understood. 

Unlike  the  results  of  Neyman  (Ref  21:47-72),  in  these 
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results  a complete  set  of  prototypes  is  assumed.  This 
assumption  is  rather  poor  in  the  last  phase  of  this  chapter 
but  was  made  to  reflect  a more  meaningful  score.  Another 
measure  that  is  used  to  reflect  the  secondary  quality  of  a 
recognition  system  is  that  of  "location."  Location  in  the 
broadest  sense  means  to  accurately  state  the  time  in  a 
particular  speech  segment  in  which  a phoneme  occurred,  given 
that  it  occurred.  It  is  important  to  realize  that  without 
location,  there  can  be  no  recognition.  This  broad  view  of 
location  was  used  in  the  analysis  of  results  in  this  study. 

The  actual  method  of  analysis  also  warrants  attention. 
Generally,  a fixed  set  of  phonemes  is  expected  and  this  set 
is  looked  for.  Scoring  is  based  on  the  success  in  finding  the 
members  of  the  predicted  list.  In  the  last  phase  of  results 
it  becomes  more  meaningful  to  see  what  was  predicted  before 
deciding  what  phonemes  should  be  looked  for  because  in  many 
cases  there  is  no  unique  combination  of  phonemes  for  a parti- 
cular utterance. 

Although  recognition  can  be  substantially  improved  by 
training  the  prototypes,  training  requires  a lot  of  manual 
processing  time.  In  order  to  keep  the  selection  process 
adaptable  for  real-time  use,  no  training  of  prototypes  was 
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Expanded  Test  Set 

The  first  phase  of  results  consisted  of  expanding  the 
Neyman  program  from  47  to  61  prototypes.  The  additional 
prototypes  consisted  of  additional  ending  or  beginning 
versions  of  sounds  and  nasalized  vowels  that  had  not  been 
included  in  the  Neyman  set.  All  the  test  words  were  embedded 
in  the  sentence  structure  "Say  (word)  Instead." 

Just  as  Neyman' s recognition  rate  dropped  as  he  ex- 
panded from  a 10  class  to  a 47  class  problem,  the  recognition 
rate  for  a 61  class  problem  fell  below  that  achieved  by 
Neyman' s 47  class  problem.  Location  percentage  remained  high 
but  the  increased  prototype  set  had  a larger  overlap  region 
in  the  decision  space  as  was  evident  from  multiple  prototype 
locations  for  a single  frame.  Two  useful  facts  found  during 
this  phase  were  (1)  combination  sounds  such  as  "st"  were 
identified  100  percent  of  the  time,  and  (2)  dipthongs  can  be 
split  into  "short  vowel-transition-short  vowel"  phonemes  for 
the  decision  segment  and  recombined  later. 

During  this  first  phase  it  became  apparent  that  an  error 
had  existed  in  the  preliminary  signal  processing  throughout 
Neyman's  analysis  and  much  of  this  current  research.  The 
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problem,  an  incorrect  shift  and  sample  procedure  within  a 
buffer  stage,  essentially  resulted  in  one  128  sample  segment 
being  used  four  times  while  the  next  three  128  sample  segments 
were  discarded.  At  this  point  the  decision  was  made  to 
reaccomplish  the  entire  process  in  the  hope  of  getting  better 
recognition  results. 

Corrected  Test  Set 

The  entire  process,  through  sampling,  prototype  selection, 
and  decision  stage  was  reaccomplished.  The  results  reflected 
almost  no  improvement  over  those  obtained  in  the  previous 
section.  This  was  not  entirely  unexpected  because  the 
Nyquist  sampling  rate  had  not  been  violated  and  the  speech 
signal  had  been  only  slightly  modified.  The  possibility 
exists  that  psuedo-f ilter ing  of  this  nature  might  be  more 
beneficial  than  harmful.  It  has  also  been  observed  that  speech 
signals  can  undergo  considerable  distortion  without  becoming 
unintelligible  (Ref  16:536). 

After  the  system  had  been  tested,  it  became  apparent 
that  no  substantial  improvement  had  been  made  over  the  origi- 
nal Neyman  system.  The  idea  of  spatial  filtering  grew  more 
appealing  as  it  seemed  a maximum  recognition  rate  had  been 
achieved  with  the  present  system. 


Pre-Filtered  Test  Set 


As  discussed  in  Chapter  IV,  before  filtering  could  be 
accomplished,  it  was  necessary  to  change  the  normalization 
scheme.  Once  the  filter  was  designed  and  the  normal ization 
reaccomplished,  prototypes  were  autocorrelated  to  ascertain 
the  maximum  correlation  value  attainable.  It  was  discovered 
that  unvoiced  sounds  of  low  energy  content  would  not  correlate 
to  the  same  level  as  a similar  voiced  sound.  This  was  be- 
cause of  a segmentation  rule  that  had  been  used  to  prevent 
normalization  of  frames  having  less  energy  than  a fixed  amount. 
Removing  this  restriction  from  the  program  resulted  in  an 
almost  perfect  correlator.  Everything  that  went  in  the 
correlator,  came  out  just  as  it  occurred.  For  instance,  if 
the  word  "church"  had  been  pronounded  "ch-ur-ch",  the  correla- 
tion scheme  would  print  "ch  h ur  ch  h"  with  the  extra  h's 
representing  the  unintentional  aspiration  that  occurred  as 
a result  of  strong  enunciation.  The  only  problem  with  this 
correlation  scheme  is  the  segmentation  problem.  The  two 
different  spectrograms  that  represent  this  situation  are 

4 

shown  in  F igure  2 . 

Figure  2(a)  shows  "energy  groups."  These  groups  in  this 
case  just  happen  to  be  words.  In  other  examples,  the  groups 
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represent  individual  syllables.  In  either  case,  the  groups 
contain  at  least  one  vowel. 

Figure  2(b)  shows  every  sound  that  occurred  including 
throat  noise,  lip  noise,  and  breathing.  A rule  was  used 
such  that  (1)  all  the  groups  of  Figure  2(a)  existed,  and 
(2)  at  most  each  group  could  have  associated  with  it  six 
frames  on  either  side  of  the  group.  Markoul  used  a similar 
device  for  detection  of  silence  and  gaps  (Ref  1:249).  The 
following  rules  help  to  solve  ambiguities  (Ref  25:85-  4) 

1)  Fricatives  often  have  a short  dip  in  energy  at  the 
start  of  frication. 

2)  A short  nasal  is  often  marked  by  a short  drop  in 
energy. 

3)  A silent  segment  followed  by  a noisy  segment  can  be 
either  a plosive  followed  by  a fricative,  or  the 
whole  sequence  can  be  an  aspirated  plosive. 

Although  the  main  impetus  of  this  research  concerned 
single- speaker  recognition,  the  success  of  Daily  and  Sutton 
with  spatial  filtering  (Ref  4)  suggested  the  attempt  at 
multiple  speaker  recognition.  Seven  sentences  were  recorded 
three  times  each  by  three  separate  speaker  subjects.  Speaker 
A was  a male  - southern  accent.  Speaker  B was  a male  - mid- 
eastern accent,  and  Speaker  C was  a female  - southern  accent. 


38 


The  spectrograms  of  the  word  "vitamins"  spoken  by  each  speaker 
show  remarkable  similarity  as  is  displayed  in  Figure  3.  This 
across-speaker  invariance  in  the  speech  spectrogram  representa- 
tion of  speech  which  might  be  the  basis  of  a speaker  indepen- 
dent recognition  algorithm. 

The  first  lesson  that  was  gained  from  this  portion  of 
the  experiment  was  that  prototypes  chosen  from  deliberately 
pronounced  words  had  little  chance  of  correctly  correlating 
with  those  of  normal  speech.  The  main  problem  seemed  to  be 
the  length  of  the  prototypes  as  they  occurred  in  slow  speech 
compared  to  the  normal  shortened  speech.  The  normalization 
that  was  being  used  did  not  rectify  the  problem.  Another 
problem,  that  has  already  been  alluded  to,  was  the  shorter 
prototype  correlation  with  noisy  segments  of  longer  proto- 
types. This  last  problem  worsened  when  spatial  filtering  was 
attempted  because  the  longer  prototypes  had  more  energy  re- 
moved from  them  by  filtering  than  did  the  shorter  prototypes. 
These  problems  motivated  the  selection  of  uniform  length 
prototypes.  These  prototypes  were  taken  from  the  seven 
sentences  that  were  recorded  by  Speaker  A.  The  sentences 
that  were  used  for  this  were  not  used  in  compiling  recognition 
results  as  this  would  not  be  an  unbiased  scoring. 


i 


39 


♦ 4 4 

4 4 4 4 4 

4 4 4 4 X 4 4 

‘4  > 4 4 4 4 

r » ^ if;  Tt  * 4 4 

x > x :r*  ws  e*  «r  > 4 4 w tffir  4 4 4444 

444444’*  44  4 

4 4 > 4 >4X4  V4X4XX4XX  X 4 X 

4 4 v 4 > x 4 •»:»  r»  jrs  yv  ry  ?v  xk  yy  >/♦ 

44X4  4 444  4 4 44  X 4 X'«r-r4 

44  4 44  4 X 4 4 4 4 X 4-4  X444 

'k~  err-  c-^4  p k .4  if\  »rN  rc>  (t  o cvk-  .-tr  .ftr-  on  ft*  c -r>*  nr**  jir  «rs.  erf-  o -4  f\,  r- 

• r ■»•  r-  ^ j t t. 4 4 j imi*  irm  tr  ir  if  u%  ino  ifi*r  vr^c  «ruo  »o»*!  *r.r  r^x.  xx 


(a)  Speaker  A 


> V 51  4 4 4 4 

4 4 X 4 4 4 

4 X >'  4 4 4 ♦ 

- ff.  v .7  «r  tr  4 4 4 >cx  4 

' cr  3T  i '-  > 4 p:  X !£  X 4 XX  vr:  cr  ET  X'  4 

v 4 v vv  v 4X4  4 

44  x >:  4444 

4 4 4 4 4 4 4 v x 4 > v x x ' 

4 4 x 4 4 x 4 x > 


V V >.•  X X X :x  X5C 
> 4-r  -rX  -4  X 4 + 


fm  ,rK  er.  «r  cr »-»  fvro  j»i*s  »r  k.  <r>rr.  c-  n *■.  rs  j-.n  erm  rvrr  rr-  ere  er---«  (V  > 

“ .f  *m.->  .r  »r  n.k  n-i's.  r*.  k.  n. n-  nr>rr  cr  <n  cr  _o  ix  cr<  az  cr  cr  cr  mcT*  r-.*r>  cr  (p  j (T  c.  cz  z 


(b)  Speaker  B 


4 4 4 

> ♦ X X X 4 

4 > X X 4 X'  X 4 

x t--  »;  ic  x x x x 4 

n*v  te  &■’  xx  v x t?:  tr  ft-  >- 

4X  4 4 4 4 4 X K’ 


X 4 4 4 4 4 4-4  4 4 44 

4 xxxxxx  xx  44  4 x xir  a c*  ex  Kg-  trep  ox 


f r .t ir  s e r r -n  ,rf.  (rn  nri  r Ki  +in  .cN  err  cr*-*  fv  4tr-  *r  k <r  r~  cr  nK 

f-  rr  mx1  r’ x i*  r?  -t.*  j jf  s j ■£■  .t  -*.*■  if.  L'-  ’is  in  ip  1 r imt  if'ir  »r  »r  »r  «r  r ,r  ,r  %r>  t s.  f„r 


(c)  Speaker  C 


Fig.  3.  Spectrograms  for  "Vitamins"  by  Three  Speakers 
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In  order  to  keep  the  prototype  set  at  61,  17  of  the 
prototypes  that  came  from  continuous  utterances  of  vowels 
were  retained.  The  complete  set  is  listed  in  Table  III. 

There  is  redundancy  in  the  selection  and  there  are  sounds  that 
are  not  included;  however,  the  scoring  is  made  as  though  a 
complete  set  existed.  In  some  cases  the  longer  prototypes 
did  correlate  highly,  but  when  they  did  it  generally  was  a 
case  of  agreement.  One  good  example  of  this  is  the  dipthong 
prototype  "AE*  from  hate  correlating  higher  with  "TA  - TE"* 
combination  from  taste.  Table  VI  describes  the  symbols  used 
for  analysis  of  sentence  data. 

Table  VI 

Sentence  Analysis  Symbols 


Symbol 

Definition 

Blank 

No  symbol  (or  a blank)  indicates  that  this  was 
the  accepted,  recognized  phoneme 

L 

An  "L"  indicates  that  although  not  recognized,  the 
maximum  value  of  correlation  indicated  proper 
location 

X 

An  "X"  indicates  that  this  phoneme  was  not  located 

* Symbols  used  are  listed  in  Table  III. 
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Table  VII  through  Table  XIV  display  the  analysis  of 
each  sentence  spoken  by  Speaker  A and  Table  XV  through 
Table  XXII  contain  the  analysis  of  one  of  each  of  the 
seven  sentences  spoken  by  Speaker  B. 
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*Rest  of  sentence  not  digitized. 


*Rest  of  Sentence  not  digitised. 


*Rest  of  Sentence  not  digitized. 
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Table  XVIII 


*Rest  of  Sentence  not  digitized. 


F iltered  Test 


In  order  to  establish  the  filter  size  and  whether  to 
normalize  before  filtering  or  after,  several  test  filters 
were  used.  These  filter  results  are  presented  in  Table  XXIII. 


Table  XXIII 

Results  With  Filtering 


Filter 

before  Normalization 

F ilter 

Number  of 

Phoneme 

Phoneme 

Size 

Phonemes 

Location  (%) 

Recognition  (%) 

5 x 13 

7 

29 

0 

7x7 

15 

80 

40 

9 x 13 

15 

93 

33 

7 x 15 

16 

81 

25 

15  x 7 

16 

88 

50 

15  x 15 

16 

100 

63 

17  x 33 

16 

94 

50 

17  x 64 

15 

93 

47 

32  x 33 

16 

100 

63 

Filter 

after  Normalization 

15  x 15 

16 

100 

19 

25  x 45 

16 

100 

56 

59 


Filtering  after  normalization  gave  the  same  location  as 
filtering  before;  however,  the  correlation  magnitudes  were 
greatly  reduced.  Filtering  before  normalization  caused  the 
correlation  coefficients  to  greatly  increase  and  crowded 
the  decision  space.  The  filter  chosen  for  filter  analysis 
was  the  25  x 45  filter  placed  after  normalization.  The 
energy  lost  with  this  large  filter  was  minimal  but  seemed 
to  maximize  phoneme  location.  Table  XXIV  presents  the 
filtered  analysis  of  five  of  the  sentences  of  Speaker  A 
and  Table  XXV  presents  a summary  of  the  analysis. 

Table  XXVI  contains  a filter  analysis  of  three  sentences 
of  Speaker  B. 

No  improvement  was  gained  by  filtering  for  Speaker  C. 
Only  one  sentence,  "Vitamins  taste  good,"  was  analyzed  for 
Speaker  C and  location  was  rated  at  89  percent  and  recogni- 
tion at  19  percent. 
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Symbol 


Symbol 


Table  XXVII 


VI.  Conclusions  and  Recommendations 


The  objective  of  this  study  was  to  try  to  solve  the 
phoneme  recognition  problem  by  computer  analysis  of  continu- 
ous speech.  More  directly,  the  objective  was  to  take  the 
basic  program  developed  by  Neyman,  to  subject  it  to  further 
testing,  and  by  incorporating  meaningful  modifications,  to 
improve  its  operation. 

For  sentence  analysis,  Neyman  achieved  location  of 
92  percent  and  identification  of  34  percent  for  a single 
speaker  and  an  incomplete  set  of  prototypes.  Assuming  a 
complete  set  of  prototypes  and  using  uniform  length  proto- 
types and  different  normalization  techniques,  this  program 
increased  identification  by  11  percent  while  location  was 
decreased  by  11  percent.  Using  an  additional  speaker, 
recognition  was  still  6 percent  better,  but  location  dropped 
by  21  percent. 

Filtering  was  investigated  next  with  the  overall  effect 
of  improving  location  while  slightly  decreasing  identifica- 
tion. The  identification  was  degraded  more  for  Speaker  A. 

This  is  related  to  the  fact  that  the  prototypes  came  from 
Speaker  A while  the  filter  sxza  analysis  was  performed  on 
Speaker  B data.  Daily  and  Sutton  found  that  spatial  filtering 
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designed  for  one  speaker  was  not  best  suited  for  the  other 
speaker  or  for  both  speakers  together  (Ref  4:36). 
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The  feature  that  shows  the  greatest  promise  of  success 
is  the  use  of  uniform  length  prototypes.  Speech  segments 
that  are  longer  than  the  prototypes  can  have  consecutive 
identification  periods.  This  allows  the  use  of  prototypes 
for  transitions.  Figure  4 (Ref  8:60-61)  suggests  that  there 
might  be  as  many  as  300  prototypes  to  cover  the  vowels, 
consonants,  and  interphonemic  transitions.  Even  this  would 
be  an  acceptable  solution  if  it  would  offer  substantially 
higher  recognition  results. 

The  increased  identification  attained  by  this  program 
continues  to  warrant  further  study.  The  prototype  set  should 
have  the  missing  phonemes  added  by  deleting  phonemes  that 
prove  to  be  redundant.  The  redundancy  should  be  identified 
by  actual  correlation  tests  so  as  not  to  destroy  unique 
prototypes.  The  correlation  process  obtains  best  results 
when  the  prototypes  are  taken  from  actual  speech.  The  warn- 
ing that  must  be  issued  here  is  that  there  is  a high  degree 
of  probability  of  selecting  portions  of  adjacent  phonemes 
as  is  surely  the  case  in  a few  of  the  existing  prototypes 
of  this  program. 
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Spectrograms  of  Vowel -Consonant  Combinations 


If  female  voices  are  to  be  used  with  male  prototypes, 
frequency  normalization  of  some  type  will  be  necessary. 

From  observing  Figure  3(c),  it  is  evident  that  substantial 
improvement  could  be  gained  by  setting  the  first  two  frequency 
channels  of  both  the  prototypes  and  the  sentence  data  to 
zero  since  they  are  missing  from  the  female  voice. 
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1.  Aliasing:  The  term  "aliasing"  refers  to  the  fact  that 

high-frequency  components  of  a time  function  can  imper- 
sonate low  frequencies  if  the  sampling  rate  is  too  low. 

2.  Allophone : The  variant  forms  of  a phoneme  as  conditioned 

by  position  or  adjoining  sounds. 

3.  Amplitude  Normalization:  The  removal  of  speech  amplitude 

as  a parameter  in  speech  sound  similarity  measurement. 
This  ensures  that  a sound  that  varies  in  energy  but  not 
in  spectral  composition  is  still  interpreted  as  the  same 
sound  (Ref  28:51). 

4.  Dipt hong:  A combination  of  two  vowels  in  the  same 

syllable,  in  which  the  speaker  glides  continuously  from 
one  vowel  to  another . 

5.  Dynamic  Range  Normalization:  The  determination  of  the 

energy  variations  of  speech  in  order  to  adjust  thresholds 
to  allow  energy  to  be  used  in  segmentation  (Ref  28:51). 

6.  Frame : A single  time  increment  of  the  digital  spectro- 

gram. 

7.  Fricatives : Sounds  produced  by  partial  constriction 

along  the  vocal  tract  which  results  in  turbulence.  The 
sounds  can  be  further  subdivided  into  voiced  and  un- 
voiced categories.  The  voiceless  fricatives  are  pro- 
duced as  a result  of  frictional  modulation.  The  voiced 
fricatives  combine  frictional  with  vocal  cord  and  cavity 
modulation. 

8.  Leakage : The  term  "leakage"  refers  to  the  discrepancy 

between  the  continuous  and  discrete  Fourier  transforms 
caused  by  the  required  time  domain  truncation. 

9.  Morpheme : Any  of  the  minimum  meaningful  elements  in  a 

language,  not  further  divisible  into  smaller  meaningful 
elements,  usually  recurring  in  various  contexts  with 
relatively  constant  meaning,  such  as  a word. 

10.  Nasals : Sounds  that  are  produced  by  allowing  the  air  to 

flow  through  the  nasal  cavities.  Coupling  the  nasal 
cavities  to  the  resonance  system  of  the  vocal  tract 
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results  in  nasalized  vowels.  If  the  air  flow  is  restric- 
ted to  only  flowing  through  the  nasal  cavities,  nasal 
consonants  are  produced. 

11.  Noise  Subtraction  Normalization:  The  determination  of 

the  energy  of  ambient  noise  and  the  subtraction  of  that 
energy  from  the  input  signal  so  that  only  the  speech 
signal  is  left  (Ref  28:51). 

12.  Phone : An  individual  speech  sound. 

13.  Phoneme : The  smallest  distinctive  group  or  class  of 

phones  in  a language.  In  a very  general  sense,  the 
phonemes  that  make  up  a speech  sound  can  be  compared  to 
the  letters  that  make  up  a written  word. 

14.  Pitch:  The  pitch  of  a sound  with  a periodic  wave  form 

- i.e.,  a voiced  sound  - is  determined  by  its  fundamental 
frequency,  or  rate  of  repetition  of  the  cycles  of  air 
pressure . 

15.  Plosives : Sounds  that  are  produced  by  a sudden  release 

of  built  up  air  pressure.  The  sounds  can  be  further 
distinguished  by  the  presence  of  absence  of  voicing.  A 
voiceless  stop  occurs  when  the  stop  is  combined  with 
fricative  modulation.  A voiced  stop  occurs  when  vocal 
cord  modulation  is  combined  with  stop  and  fricative 
modulation. 

16.  Pragmatic  Knowledge:  A record  of  changes  in  the  listener's 

world  model  occurring  in  the  course  of  a conversation. 

17.  Prosodic  Knowledge:  Imputes  meaning  to  the  variation  in 

pitch  or  stress  in  phrases. 

18.  Semantic  Knowledge:  General  knowledge  about  the  domain 

of  discourse. 

19.  Speaker  Spectra  Normalization:  The  transformation  of  the 

power  spectral  density  function  in  order  to  remove  the 
effects  of  differing  vocal  tract  lengths  (Ref  28:51). 

20.  Syntactic  Knowledge:  A set  of  rules  specifying  legal 

seguencies  of  words  or  similar  units. 
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21.  Time  Normalization:  The  stretching  or  shrinking  of  the 

length  of  time  elapsed  between  given  speech  segments 
(Ref  28:51). 

22.  Velocity  Normalization:  Shortening  of  steady  state 

speech  segments  to  remove  artificial  variations  in  sound 
duration  due  to  variations  in  speaking  rate  (Ref  28:51). 

23.  Vowels : Sounds  whose  source  of  excitation  is  the  glottis. 

During  vowel  production,  the  vocal  tract  is  relatively 
open  and  the  air  flows  over  the  center  of  the  tongue, 
causing  a minimum  of  turbulence.  The  phonetic  value  of 
the  vowel  is  determined  by  the  resonances  of  the  vocal 
tract,  which  are  in  turn  determined  by  the  shape  and 
position  of  the  tongue  and  lips. 
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dict phoneme  locations  and  the  values  were  compared  in  order  to 
identify  the  correct  phoneme. 

The  phonemes  were  selected  from  Speaker  A's  speech  signal 
and  tests  were  conducted  to  analyze  utterances  from  Speaker  A 
and  Speaker  B.  For  Speaker  A,  location  was  rated  at  81  percent 
while  identification  was  rated  at  45  percent.  For  Speaker  B, 
location  was  found  to  be  70  percent  with  identification  at  40 
percent . 

Spatial  filtering  techniques,  uniform  length  prototypes, 
and  various  normalization  procedures  were  investigated  next  with 
the  result  of  improving  location  for  Speaker  B. 


